1 Abstract

We use 2023 data from the New York City Police Department’s (NYPD) stop-and-frisk program. The dataset almost entirely consists of dummy and categorical variables, however there are three continuous variables (height, weight, and age), and a timestamp variable. The dataset requires significant cleaning due to inconsistencies in the usage of N/A and null entries. The use of timestamps requires feature engineering to convert it into relevant categorical variables, and control for seasonality. Finally, continuous variables also require feature engineering. We aim to predict arrests using several different classification estimation methods. We fit and compare a logit estimation, different forms of shrinkage regressions, and a random forest estimation. We find that the random forest model fits best in the test data, and it indicated demographic factors, both related to the individual and the census tract in which they were stopped, were the most important predictors of arrest.

2 Introduction

For over 20 years, civilians in New York City (NYC) may be stopped, question, and searched by the police based solely on suspicion, these stops are recorded and the data made publicly available on an annual basis.We use this data from 2023 which can be found in the NYPD Stop, Question and Frisk Data. The data is collected from the 77 NYPD police precincts which span New York City’s 5 boroughs. This is a stop-level dataset; each observation (row) corresponds to a unique stop made by an NYPD police officer in 2023 as part of the SQF programme. Whenever a stop occurs under the stop-and-frisk programme, officers are required to record and classify it and, therefore, all such stops that occurred during the 2023 calendar year correspond to an observation in the dataset.

  • Target variable: Our goal is to predict arrested_flag, which is a binary indicator that equals 1 if the individual who was stopped was arrested, and 0 otherwise.1

  • Motivation: The stop-and-frisk program has been highly criticized as it does not require evidence to stop individuals and may allow police to use racial profiling tactics to target individuals. The American Civil Liberties Union of New York found that the program disproportionately stops civilians of colour, in particular Black civilians, with some stops resulting in police misconduct. As the stops do not require prior evidence or warrants, we are interested in predicting which stops result in arrest. We particularly focus on factors gathered before the stop begins that may predict arrest, such as demographic characteristics of civilians, location, time of day, and attributes of the police officers. This sheds light on which factors are most important in predicting arrests, and whether race is an important factor once a suspect has been stopped.

  • Relevant Features and Methods: The dataset contains 82 columns, including identifying features, location features, timestamps, continuous features and dummy variables. We keep as many features as relevant, however, we drop certain features such as identifying features, features with a high proportion of NAs, and any “excess” features that become irrelevant after feature engineering and extraction. We then reclassify categorical features into dummy variables. This yields a dataset of 155 features in our full model and 125 in our subset model, which only considers features measurable before the stop. We explore comparisons between a logit regression model, shrinkage estimators, and a random forest. In particular, we exploit the ability of shrinkage estimators and random forests to deal with sparse models. We also run models on test and validation subsets before comparing them across similar test models when comparing the predictive abilities to avoid overfitting.

  • Preliminary Results: Whilst we find that the logit model does a decent job of prediction, it is improved on by shrinkage estimators. Of the shrinkage estimators, the elastic net estimator performs best. In turn the random forest estimator performs even better. This suggests that our data exhibits mild sparsity and non-linearities, hence why all the models perform fairly well but the random forest performs best. The random forest suggests that racial and wealth based factors are the most important in predicting arrests.

3 Data Description

We begin the project by installing and loading in the necessary libraries.

# Install and load required packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
library(pacman)

p_load(readxl, dplyr, ggplot2, knitr, lubridate, tidyr, sf, httr, caret, glmnet, stringr, remotes, RColorBrewer, viridis, scales, classInt, forcats, pROC, randomForest)

We proceed by setting up the file path and importing the data from our project’s GitHub repository. We see that the raw dataset has 16971 observations and 82 features, as shown below.

# get raw content of the file 
response <- GET("https://raw.githubusercontent.com/rrachelkane/data-science-group-project/main/data/sqf-2023.xlsx")

# retrieve the .xlsx file
if (status_code(response) == 200) {
  # create a temporary file to save the downloaded content
  temp_file <- tempfile(fileext = ".xlsx")
  
  # Write the raw content to the temporary file
  writeBin(content(response, "raw"), temp_file)
  
  # Read the Excel file from the temporary file
  sqf_data <- read_xlsx(temp_file)
  
  # View the first few rows of the data
  head(sqf_data)
} else {
  stop("Failed to download the file.")
}
## # A tibble: 6 × 82
##   STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2  DAY2  STOP_WAS_INITIATED
##     <dbl> <chr>           <chr>           <dbl> <chr>   <chr> <chr>             
## 1       1 2023-01-01      00:44:00         2023 January Sund… Based on Radio Run
## 2       2 2023-01-01      00:49:00         2023 January Sund… Based on Self Ini…
## 3       3 2023-01-01      05:31:00         2023 January Sund… Based on Radio Run
## 4       4 2023-01-01      04:59:00         2023 January Sund… Based on Self Ini…
## 5       5 2023-01-01      05:21:00         2023 January Sund… Based on Self Ini…
## 6       6 2023-01-01      18:00:00         2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## #   ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## #   SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## #   SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## #   LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## #   JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## #   SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …
# check original dimensions
dim(sqf_data)
## [1] 16971    82
# view head
head(sqf_data)
## # A tibble: 6 × 82
##   STOP_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2  DAY2  STOP_WAS_INITIATED
##     <dbl> <chr>           <chr>           <dbl> <chr>   <chr> <chr>             
## 1       1 2023-01-01      00:44:00         2023 January Sund… Based on Radio Run
## 2       2 2023-01-01      00:49:00         2023 January Sund… Based on Self Ini…
## 3       3 2023-01-01      05:31:00         2023 January Sund… Based on Radio Run
## 4       4 2023-01-01      04:59:00         2023 January Sund… Based on Self Ini…
## 5       5 2023-01-01      05:21:00         2023 January Sund… Based on Self Ini…
## 6       6 2023-01-01      18:00:00         2023 January Sund… Based on Radio Run
## # ℹ 75 more variables: RECORD_STATUS_CODE <chr>, ISSUING_OFFICER_RANK <chr>,
## #   ISSUING_OFFICER_COMMAND_CODE <chr>, SUPERVISING_OFFICER_RANK <chr>,
## #   SUPERVISING_OFFICER_COMMAND_CODE <chr>,
## #   SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED <chr>,
## #   LOCATION_IN_OUT_CODE <chr>, JURISDICTION_CODE <chr>,
## #   JURISDICTION_DESCRIPTION <chr>, OBSERVED_DURATION_MINUTES <dbl>,
## #   SUSPECTED_CRIME_DESCRIPTION <chr>, STOP_DURATION_MINUTES <dbl>, …

4 Data Cleaning

We then begin going through our dataset for cleaning.

4.1 Column Names

First, we change column names from strictly uppercase to strictly lowercase, because it’s cuter.

colnames(sqf_data) <- tolower(colnames(sqf_data))

# check
colnames(sqf_data)[1:3]
## [1] "stop_id"         "stop_frisk_date" "stop_frisk_time"

4.2 Relevant Columns

Data is collected throughout the stop and some variables like search_base_incident_to_arrest are observed explicitly after an arrest has occurred, whilst others like suspect_height are unchanged from before the arrest. Some other variables like search_flag are a measure of an activity that may have occurred either before or after an arrest.

Here, we drop any variables which clearly take place after an arrest, our outcome of interest, since they cannot be used for our prediction question. We maintain unclear variables but address this later, by training and testing the models with both the “full” set of predictors, and then a subset which drops the variables with unclear temporal ordering in relation to arrest.

Here, we also drop spatial features other than x and y coordinates of the stop, as this is the most granular spatial information and we use this to map each stop to census tracts to obtain demographic information corresponding to the location of the stop. These dropped spatial features also have high cardinality, which would add many dummies to our model, undermining computational efficiency.

sqf_data <- sqf_data %>% 
  select(- c("stop_frisk_date", "record_status_code", "supervising_action_corresponding_activity_log_entry_reviewed", "stop_location_sector_code", "stop_location_apartment", "stop_location_full_address", "stop_location_patrol_boro_name", "stop_location_street_name", "suspect_other_description", "observed_duration_minutes", "stop_duration_minutes", "summons_issued_flag", "supervising_officer_command_code", "issuing_officer_command_code", "stop_location_precinct", "year2", "suspect_arrest_offense"))

# check new dim
dim(sqf_data)
## [1] 16971    65

4.3 Missing Values

First, we note that there is only 1 column with any instance of an NA value.

na_cols <- colMeans(is.na(sqf_data)) * 100 
na_cols[na_cols > 0]
## demeanor_of_person_stopped 
##                   15.01385

The process generating the missingness of demeanor_of_person_stopped is unclear.

sqf_data[1:5, "demeanor_of_person_stopped"]
## # A tibble: 5 × 1
##   demeanor_of_person_stopped      
##   <chr>                           
## 1 YELLING AND KNOCKING ON THE DOOR
## 2 VILIGANT                        
## 3 UPSET                           
## 4 CALM                            
## 5 EVASIVE

Imputation of this would be difficult, so we drop this column.

# drop 
sqf_data <- sqf_data %>% 
  select(-("demeanor_of_person_stopped"))

# check new dim
dim(sqf_data)
## [1] 16971    64

Additionally, there are many observations in the data with values == (null) across different columns.

# get % of nulls, in columns with at least one null
null_cols <- (colMeans(sqf_data == "(null)") * 100)[colMeans(sqf_data == "(null)") * 100 > 0]

# make df for plot
null_cols_df <- data.frame(Feature = names(null_cols), Percentage = null_cols)

dim(null_cols_df)
## [1] 47  2
# order for plot
null_cols_df$Feature <- factor(null_cols_df$Feature, 
                              levels = null_cols_df$Feature[order(null_cols_df$Percentage, decreasing = FALSE)])

# plot
ggplot(null_cols_df, aes(x = Feature, y = Percentage)) +
  geom_bar(stat = "identity", fill = "lightblue", color = "darkblue") +
  labs(title = "Percentage of (null) Values per Column", 
       x = "Columns", 
       y = "Percentage of (null) Values") +
  coord_flip() +  # Flip coordinates
  theme_minimal()

Note, however, that not all of these (null) observations are equivalent:

  • in some columns - particularly those with lower percentages of (null) values, (null) means the data are genuinely effectively NA, as there are instances of both “Y” and “N” (for binary variable for example), alongside (null).
sqf_data %>% 
  group_by(ask_for_consent_flg) %>% 
  summarise(N = n()) %>% 
  kable()
ask_for_consent_flg N
(null) 779
N 14187
Y 2005
  • whereas in other cases, the null values are actually used as “N”.
print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
sqf_data %>% 
  group_by(weapon_found_flag, firearm_flag) %>% 
  summarise(N = n()) %>% 
  kable()
## `summarise()` has grouped output by 'weapon_found_flag'. You can override using
## the `.groups` argument.
weapon_found_flag firearm_flag N
N (null) 14310
N Y 28
Y (null) 1495
Y Y 1138
# note that for the identifying variables related to cops, "yes" entries are indicated unusually
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "(null)" "I"
print(unique(sqf_data$verbal_identifies_officer_flag))
## [1] "(null)" "V"
print(unique(sqf_data$shield_identifies_officer_flag))
## [1] "(null)" "S"

Note here that even though a firearm_flag has a "Y" entry and weapon_found_flag has a "N" entry, this is not necessarily incorrect, as the officer can have identified a firearm without having to carry out a frisk nor a `search.

We deal with these cases of (null) separately:

  • we replace the second type of (null) with "N" values
# initialize empty vector
null_2 <- c()

# loop through columns
# loop through columns
for (col in names(sqf_data)) {
  # Get unique values of the column
  unique_values <- unique(sqf_data[[col]])
  
  # Check if unique values are exactly a subset of "Y", "I", "V", "S", and "(null)"
  if (all(unique_values %in% c("Y", "I", "V", "S", "(null)")) && length(unique_values) == 2) {
    null_2 <- c(null_2, col)  # Add column name to null_2
  }
}

# check n of type 2 nulls
length(null_2)
## [1] 30
# pre-clean check examples
print(unique(sqf_data$firearm_flag))
## [1] "(null)" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "(null)" "I"
# replace these nulls with Ns
sqf_data <- sqf_data %>%
  mutate(across(all_of(null_2), ~ ifelse(. == "(null)", "N", .)))

# post-clean check examples
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "N" "I"
  • now replace the first type with actual NA values:
# initialize empty vector
null_1 <- c()

# loop through columns
for (col in names(sqf_data)) {
  
  # for columns not in null_2
  if (!(col %in% null_2)) {
    # if "(null)" is present in the column
    if ("(null)" %in% sqf_data[[col]]) {
      null_1 <- c(null_1, col)  # add column name to the vector
    }
  }
}

# check length
length(null_1)
## [1] 17
# pre-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N"      "Y"      "(null)"
# replace these with NAs
sqf_data <- sqf_data %>%
  mutate(across(all_of(null_1), ~ ifelse(. == "(null)", NA, .)))

# post-clean check
print(unique(sqf_data$ask_for_consent_flg))
## [1] "N" "Y" NA

Now, we calculate the percentage of actual missing values, correctly identified by "NA":

# get % of NAs, in columns with at least one NA
na_cols <- (colMeans(is.na(sqf_data)) * 100)[colMeans(is.na(sqf_data)) * 100 > 0]

# make df for plot
na_cols_df <- data.frame(Feature = names(na_cols), Percentage = na_cols)

# order for plot
na_cols_df$Feature <- factor(na_cols_df$Feature, 
                              levels = na_cols_df$Feature[order(na_cols_df$Percentage, decreasing = FALSE)])

# plot
ggplot(na_cols_df, aes(x = Feature, y = Percentage)) +
  geom_bar(stat = "identity", fill = "#F8566D", color = "black") +
  labs(title = "Percentage of NA Values per Column", 
       x = "Columns", 
       y = "Percentage of NA Values") +
  coord_flip() +  # Flip coordinates
  theme_minimal()

Given the above, we

  • drop columns where more than 25% of observations are missing
sqf_data <- sqf_data %>% 
  select(-all_of(names(na_cols[na_cols > 25])))

dim(sqf_data)
## [1] 16971    59
  • drop the remaining observations where there are missing values
sqf_data <- sqf_data %>%
  filter(!if_any(everything(), is.na))

dim(sqf_data)
## [1] 12095    59

4.4 Coding of Binary Variables

We change the coding of binary variables:

# pre check
print(unique(sqf_data$firearm_flag))
## [1] "N" "Y"
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] "N" "I"
# clean Ys and Ns and set as numeric
sqf_data <- sqf_data %>%
  mutate(across(
    where(~ all(. %in% c("Y", "N", "I", "V", "S"))), 
    ~ as.numeric(ifelse(. %in% c("Y", "I", "V", "S"), 1, 0))
  ))

# post check
print(unique(sqf_data$firearm_flag))
## [1] 0 1
print(unique(sqf_data$id_card_identifies_officer_flag))
## [1] 0 1

4.5 Feature Transformation/Engineering

We also bin the time of the stop from stop_frisk_time:

sqf_data <- sqf_data %>%
  mutate(
    time_of_day = case_when(
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("00", "01", "02", "03", "04", "05") ~ "Late Night",
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("06", "07", "08", "09", "10", "11") ~ "Morning",
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("12", "13", "14", "15", "16", "17") ~ "Afternoon",
      str_extract(stop_frisk_time, "^\\d{2}") %in% c("18", "19", "20", "21", "22", "23") ~ "Evening",
      TRUE ~ NA_character_
    ),
    time_of_day = factor(time_of_day, levels = c("Late Night", "Morning", "Afternoon", "Evening"))
  )

# check
print(table(sqf_data$time_of_day))
## 
## Late Night    Morning  Afternoon    Evening 
##       2807        909       2753       5626
# now drop stop frisk time as we will just use time_of_day
sqf_data <- sqf_data %>%
  select(-"stop_frisk_time")

We also convert other relevant variables to factors as needed:

# convert character columns to factors, except for stop location x and y
sqf_data <- sqf_data %>%
  mutate(across(
    .cols = where(is.character) & !c("stop_location_x", "stop_location_y"),
    .fns = as.factor
  ))

Next, we will make suspect_height, suspect_weight, and suspect_reported_age numeric keeping in mind the factor levels to ensure R converts them correctly. Then, we will address outliers in the data.

# define convert factor to numeric, handling non-numeric entries
convert_to_numeric <- function(x) {
  # convert to character to avoid factor levels issues
  x <- as.character(x)
 
  # replace non-numeric values with NA (e.g., "unknown", "760", "7.6")
  x <- gsub("[^0-9.]", "", x)
 
  # convert to numeric
  as.numeric(x)
}

# apply the function to the relevant columns
sqf_data <- sqf_data %>%
  mutate(
    suspect_reported_age = convert_to_numeric(suspect_reported_age),
    suspect_height = convert_to_numeric(suspect_height),
    suspect_weight = convert_to_numeric(suspect_weight)
  )

# check to make sure successful
summary(sqf_data$suspect_reported_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.00   25.00   27.77   34.00  118.00
summary(sqf_data$suspect_height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   5.400   5.700   5.649   5.900   7.600
summary(sqf_data$suspect_weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   150.0   160.0   166.3   180.0   760.0

Before addressing outliers in the three numerical features, we first need to transform the height variable as it is currently in the format of feet.inches. This creates an issue as an observation of 5.10 means 5 feet 10 inches, but 5.1 means 5 feet 1 inch. We will transform the inches into feet so each observation is only in feet.

It should be noted that in the raw data we have values such as 5.90, which really means 5 feet 9 inches, not 5 feet 90 inches. To address this during the conversion, we checked to see if the inches amount had 1 or 2 decimals and handled accordingly. We also ensured all the inches amounts were in between 0 and 11.9 to ensure none were accidentally converted incorrectly.

# Function to convert feet.inches to feet
convert_to_feet <- function(feet_inches) {
  # Extract feet (integer part)
  feet <- floor(feet_inches)
  
  # Get the fractional part
  fractional_part <- feet_inches - feet
  
  # Interpret inches based on the fractional part
  if (fractional_part == 0) {
    # No fractional part means no additional inches
    inches <- 0
  } else {
    # Convert fractional part to a string to check its length
    fractional_str <- as.character(fractional_part)
    
    if (grepl("\\.\\d$", fractional_str)) {
      # Case like `.1`: Single digit after decimal -> 1, 2, ..., 9 inches
      inches <- fractional_part * 10
    } else if (grepl("\\.\\d0$", fractional_str)) {
      # Case like `.10`, `.20`, etc.: Two digits ending in `0` -> 10, 20, etc. inches
      inches <- fractional_part * 100
    } else {
      # Case like `.11`, `.12`, etc.: Two digits not ending in `0` -> Exact inches
      inches <- fractional_part * 100
    }
  }
  
  # Validate inches (should be between 0 and 11)
  if (inches < 0 || inches > 11.9) {
    warning(paste("Invalid height input:", feet_inches, "- Inches must be between 0 and 11.9. Returning NA."))
    return(NA) # We have no NAs, so inches were extracted correctly. 
  }
  
  # Convert inches to feet
  inches_in_feet <- inches / 12
  
  # Return total height in feet
  return(feet + inches_in_feet)
}

# Apply the conversion function to the 'suspect_height' column
sqf_data$suspect_height_feet <- sapply(sqf_data$suspect_height, convert_to_feet)

# Check the result
tail(sqf_data[, c("suspect_height", "suspect_height_feet")])
## # A tibble: 6 × 2
##   suspect_height suspect_height_feet
##            <dbl>               <dbl>
## 1           5.7                 5.58
## 2           5.11                5.92
## 3           5.9                 5.75
## 4           5.8                 5.67
## 5           6.2                 6.17
## 6           5.8                 5.67
sqf_data$suspect_height <- sqf_data$suspect_height_feet

#drop suspect_height_feet
sqf_data <- sqf_data %>% dplyr::select(-suspect_height_feet)

We now plot the density of height in feet to see the distribution and identify outliers. We are going to drop observations below 4 ft and above 7 ft.

# compute density for suspect_height
height_density <- density(sqf_data$suspect_height, na.rm = TRUE)

# plot the density
plot(height_density,
     main = "Density of Height with Outliers Highlighted",
     xlab = "Height",
     ylab = "Density",
     col = "black",
     lwd = 2)
grid()

# identify and highlight outliers (e.g., heights below 4 feet and above 7 feet)
outliers <- sqf_data$suspect_height[sqf_data$suspect_height < 4 | sqf_data$suspect_height > 7]

# add vertical lines to mark the outlier boundaries
abline(v = c(4, 7), col = "darkorchid2", lty = 2, lwd = 2)

# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkorchid2", pch = 19)

# drop outlier observations where height is above 7 ft and below 4 ft.
sqf_data <- sqf_data[sqf_data$suspect_height >= 4 & sqf_data$suspect_height <= 7, ]

We plot the density of age to see the distribution. Since we have some outliers, we dropped any observations where age is below 10 and above 85. We also noticed the suspect_reported_age data is quite skewed with a right-tail, so we applied a logarithmic transformation to this variable.

# compute density for reported_age
reported_age_density <- density(sqf_data$suspect_reported_age, na.rm = TRUE)

# plot the density
plot(reported_age_density,
     main = "Density of Reported Age with Outliers Highlighted",
     xlab = "Reported Age",
     ylab = "Density",
     col = "black",
     lwd = 2)
grid()

# identify and highlight outliers (ages < 10 and > 85)
outliers <- sqf_data$suspect_reported_age[sqf_data$suspect_reported_age < 10 | sqf_data$suspect_reported_age > 85]

# add vertical lines to mark the outlier boundaries
abline(v = c(10, 85), col = "deeppink", lty = 2, lwd = 2)

# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "deeppink", pch = 19)

# drop outlier observations where age is above 85 and below 10.
sqf_data <- sqf_data[sqf_data$suspect_reported_age >= 10 & sqf_data$suspect_reported_age <= 85, ]
#Since the `suspect_reported_age` data is quiet skewed, we will perform a logarithmic transformation. 
sqf_data$suspect_reported_age <- log(sqf_data$suspect_reported_age)

We plot the density of weight to identify outliers. The weight variable is measured in pounds and it appears that some observations could have data entry errors (for example: 9, 16, 760). We will drop outliers above 300 lbs and below 90 lbs.

# compute density for suspect_weight
weight_density <- density(sqf_data$suspect_weight, na.rm = TRUE)

# plot the density
plot(weight_density,
     main = "Density of Suspect Weight with Outliers Highlighted",
     xlab = "Weight",
     ylab = "Density",
     col = "black",
     lwd = 2)
grid()

# identify and highlight outliers (e.g., weights below 90 lbs and above 300 lbs)
outliers <- sqf_data$suspect_weight[sqf_data$suspect_weight < 90 | sqf_data$suspect_weight > 300]

# add vertical lines to mark the outlier boundaries
abline(v = c(90, 300), col = "darkturquoise", lty = 2, lwd = 2)

# highlight the outlier data points on the plot
points(outliers, rep(0, length(outliers)), col = "darkturquoise", pch = 19)

# drop outlier observations where height is above 350 lbs. and below 90 lbs.
sqf_data <- sqf_data[sqf_data$suspect_weight >= 90 & sqf_data$suspect_weight <= 300, ]

Finally, we will standardize our numerical variables as they use different scales. This will allow us to better compare the coefficients of these variables’ predictive capability.

# standardize the numeric variables
sqf_data <- sqf_data %>%
  mutate(
    suspect_reported_age = scale(suspect_reported_age),
    suspect_height = scale(suspect_height),
    suspect_weight = scale(suspect_weight)
  )

# check if the standardization worked by summarizing the variables
summary(sqf_data$suspect_reported_age)
##        V1         
##  Min.   :-2.3918  
##  1st Qu.:-0.7609  
##  Median :-0.0636  
##  Mean   : 0.0000  
##  3rd Qu.: 0.7177  
##  Max.   : 2.8599
summary(sqf_data$suspect_height)
##        V1          
##  Min.   :-5.25715  
##  1st Qu.:-0.56747  
##  Median :-0.07382  
##  Mean   : 0.00000  
##  3rd Qu.: 0.41983  
##  Max.   : 3.62856
summary(sqf_data$suspect_weight)
##        V1         
##  Min.   :-2.4893  
##  1st Qu.:-0.5337  
##  Median :-0.2078  
##  Mean   : 0.0000  
##  3rd Qu.: 0.4440  
##  Max.   : 4.3551

5 Exploratory Data Analysis

5.1 Basic Summary Statistics

We first look at some basic summary statistics.

# check dim again
dim(sqf_data)
## [1] 12044    59
# tabulation of the dependent variable 
sqf_data %>% 
  group_by(suspect_arrested_flag) %>% 
  summarise(N = n(),
            Pc = N / nrow(sqf_data) * 100) %>% 
  arrange(desc(N)) %>% 
  kable(booktabs = TRUE, col.names = c("Suspect Arrested", "N Stops", "% Total Stops"), align = "l")
Suspect Arrested N Stops % Total Stops
0 8270 68.6649
1 3774 31.3351
# looking at the distribution of sex
ggplot(sqf_data, aes(x = suspect_sex, fill = suspect_sex)) +
  geom_bar() +
  labs(
    title = "Distribution of Suspect Sex", 
    x = "Sex", 
    y = "N Stops"
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c("MALE" = "lightblue", "FEMALE" = "pink") 
  ) +
  theme(legend.position = "none")

# sex by arrest status
ggplot(sqf_data, aes(x = suspect_sex, fill = factor(suspect_arrested_flag))) +
  geom_bar(position = "fill") +
  labs(title = "Distribution of Arrest Status, By Suspect Sex", 
       x = "Sex", 
       y = "% Stops") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(type = "qual", palette = "Pastel2", name = "Suspect Arrested") 

We see that the vast majority of stops are conducted on men, however, once stopped women are arrested at a slightly higher rate than men.

# empirical cdf of age by sex and arrest status
ggplot(sqf_data, aes(x = suspect_reported_age, color = factor(suspect_arrested_flag))) +
  stat_ecdf(geom = "step") +
  facet_wrap(~ suspect_sex, ncol = 2) +
  scale_color_manual(values = c("0" = "red", "1" = "darkgreen"),
                     labels = c("Not Arrested", "Arrested"),
                     name = "Arrest Outcome") +
  labs(x = "Suspect Reported Age", y = "ECDF", title = "Empirical CDF of Suspect Reported Age, By Sex and Arrest Status") +
  theme_minimal()

The empirical CDF shows us that the youngest and oldest men are more likely to be arrested when stopped, especially when compared with women. This suggests that young adult men are “overstopped” relative to how much they should be. For reference, before conducting our logarithmic transformation and normalizing suspect_reported_age, we found the mean age of females and males was 30 and 27 years, respectively.

# arrests by race, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  xlab("Suspect Race") +
  ylab("N Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspect Race")

# arrests by race, stacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspect_race_description)), fill = factor(suspect_arrested_flag))) +
  geom_bar(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  xlab("Suspect Race") +
  ylab("% Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspect Race")

In terms of racial components, black and hispanic individuals dominate the number of arrests, however, arrest rates do not differ substantially by race, apart from for Native Americans. This is likely due to the small number of stops of individuals of that race, which could lead to anomalies.

# arrests by suspected crime description, unstacked
ggplot(data = sqf_data, aes(x = fct_rev(fct_infreq(suspected_crime_description)), fill = factor(suspect_arrested_flag))) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  xlab("Suspected Crime") +
  ylab("N Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspected Crime")

# arrests by suspected crime description, stacked
ggplot(data = sqf_data, aes(x = suspected_crime_description, fill = factor(suspect_arrested_flag))) +
  geom_bar(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  xlab("Suspected Crime") +
  ylab("% Stops") +
  scale_fill_brewer(type = "qual", palette = "Pastel1", name = "Suspect Arrested") +
  labs(title = "Distribution of Arrest Status, By Suspect Crime")

The majority of stops occur under suspicion of criminal possession of a weapon (CPW), or for various forms of theft. However, arrests vary by suspected crime with forcible touching and graffiti being those with the highest arrest rates, whilst suspected drug offences appear to have lower arrest rates.

# heatmap of arrests by time of day and borough
ggplot(
  sqf_data %>%
    group_by(stop_location_boro_name, time_of_day) %>%
    summarise(
      percent_arrested = mean(suspect_arrested_flag, na.rm = TRUE) * 100,
      .groups = "drop"
    ),
  aes(
    x = time_of_day,
    y = stop_location_boro_name,
    fill = percent_arrested
  )
) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(name = "Arrest Rate (%)") +
  labs(
    title = "Heat Map of Arrest Rates",
    x = "Time of Day",
    y = "Borough"
  ) +
  theme_minimal()

The heat map shows us that the arrest rate in the Bronx is substantially lower than in Manhattan and Brooklyn, which in turn is lower than Queens and Staten Island. We also see that arrest rates are lower at evening. This could be because these are the times and areas where the most stop and searches occur and therefore, the arrest rates fall because the increase in the number of stops outstrips the increase in arrests.

5.2 Spatial Data Visualisation

We also wanted to consider how stops and arrests were distributed across NYC. We first clean the data for spatial mapping using the sf and nycgeo packages.

We use this to obtain information about the stops at the census tract level, due to its granularity and the availability of population statistics at this level. This will be important for prediction as it will allow us to factor in how police approach specific neighbourhoods that have connotations of low or high-crime. “High-crime” neighbourhoods will likely see more intense policing and more stops which will be associated with a lower arrest rate.

# drop 7 observations which have incorrect spatial info
sqf_data <- sqf_data %>% 
  filter(stop_location_x > 0)

dim(sqf_data)
## [1] 12037    59
# make spatial object for mapping
sqf_data_sf <- st_as_sf(sqf_data, 
                        coords = c("stop_location_x", "stop_location_y"), 
                        crs = 2263)  #  crs for New York (EPSG:2263)

# load in nta-level shapefile
remotes::install_github("mfherman/nycgeo")
## Skipping install of 'nycgeo' from a github remote, the SHA1 (4fee55c1) has not changed since last install.
##   Use `force = TRUE` to force installation
library(nycgeo)
nyc_tract_shp <- nycgeo::nyc_boundaries(geography = "tract", add_acs_data = TRUE)
## old-style crs object detected; please recreate object with a recent sf::st_crs()
# check crs
st_crs(nyc_tract_shp)$epsg
## [1] 2263

5.2.1 Stop-Level Maps

We then look at how stops and arrests are distributed across NYC.

# plot data onto shapefile by arrest status
ggplot() +
  geom_sf(data = nyc_tract_shp, fill = "lightblue", color = "black", size = 0.3) +
  geom_sf(data = sqf_data_sf, aes(color = as.factor(suspect_arrested_flag)), size = 0.7) +
  scale_color_manual(values = c("red", "seagreen"),  
                     labels = c("Arrested", "Not Arrested")) +
  theme_minimal() +
  labs(title = "NYC Police Stops by Arrest Status") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        legend.title = element_blank())

It is difficult to make any clear discernations due to the high frequency of stops, however, most stops and arrests appear to occur in a corridor along the Bronx, Manhattan and Brooklyn (particularly around the Bronx and Central Brooklyn). This would match up with our earlier suggestion that that these areas have a lower arrest rate because they are more heavily policed.

5.2.2 Tract Level Maps

We then consider the percentage of stops resulting in arrests.

# join datasets to assign each stop to a tract
sqf_data_sf <- st_join(sqf_data_sf, nyc_tract_shp)
dim(sqf_data_sf)
## [1] 12037   104
# aggregate to tract level
sqf_data_sf_tract_level <- sqf_data_sf %>%
  filter(!is.na(geoid)) %>%
  group_by(geoid) %>%
  summarize(pc_arrest = (sum(suspect_arrested_flag) / n()) * 100)

# join with shp for mapping
sqf_data_sf_tract_level <- nyc_tract_shp %>%
  st_join(sqf_data_sf_tract_level, by = "geoid")
  
ggplot() +
  geom_sf(data = sqf_data_sf_tract_level, aes(fill = pc_arrest), color = "black", size = 0.3) +
  scale_fill_viridis_c(
    name = "% Stops Ending in Arrest",
    option = "inferno",
    na.value = "white"
  ) +
  theme_void() +
  labs(title = "Percentage of Stops Ending in Arrest by NYC Census Tract")

The lowest percentages appear to be in the areas where we saw most stops occurring before, that is the Bronx and Central Brooklyn. On the flipside a higher percentage looks to occur in lower stop neighbourhoods like Queen’s and Staten Island. This suggests that the Bronx and Central Brooklyn are overpoliced/targeted relative to Queen’s and Staten Island. This is more evidence to support the suggestion that areas which are more heavily policed have a lower arrest rate.

5.2.3 Tract-Level Population Predictors

Additionally, the nycgeo package allows us to link with neighbour tabulation area level American Community Survey (2013-2017) population statistics.

We want to use this to be able to see how certain demographic factors are distributed across NYC to see if any potential correlations appear between certain demographic factors, like race.

We visualize those here:

# we first create the new variables in the shapefile for ease of plotting
nyc_tract_shp <- nyc_tract_shp %>%
  mutate(pc_pop_black_est = (pop_black_est / pop_total_est) * 100,
         pc_pop_hisp_est = (pop_hisp_est / pop_total_est) * 100,
         pc_pop_asian_est = (pop_asian_est / pop_total_est) * 100,
         pc_pop_ba_above_est = (pop_ba_above_est / pop_total_est) * 100,
         pc_pop_inpov_est = (pop_inpov_est / pop_total_est) * 100
  )
# non hispanic black
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_black_est)) +
  scale_fill_viridis_c(
    name = "Non-Hispanic Black % Total Population",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "Non-Hispanic Black Population by Census Tract, ACS 2013-2017")

# hispanic any
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_hisp_est)) +
  scale_fill_viridis_c(
    name = "Hispanic Any Race % Total Population",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "Hispanic Any Race Population by Census Tract, ACS 2013-2017")

# non
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_asian_est)) +
  scale_fill_viridis_c(
    name = "Non-hispanic Asian % Total Population",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "Non-hispanic Asian Population by Census Tract, ACS 2013-2017")

# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_ba_above_est)) +
  scale_fill_viridis_c(
    name = "% Population Aged >= 25 with at Least Bachelors Degree",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "% Population Aged >= 25 with at least a Bachelor's Degree by Census Tract, ACS 2013-2017")

# income below pov
# pop age 25 years or older with at least bachelors degree
ggplot(nyc_tract_shp) +
  geom_sf(aes(fill = pc_pop_inpov_est)) +
  scale_fill_viridis_c(
    name = "% Population With Income Below Poverty Line",
    option = "inferno"
  ) +
  theme_void() +
  labs(title = "% Population with Income Below Poverty Line, ACS 2013-2017")

We see that Black populations are mostly found in the Bronx, Central Brooklyn, and South Queen’s; Hispanics in the Bronx and North Queen’s; Asians in South Manhattan and North Queen’s; college graduates in Manhattan; and poorer communities in North Manhattan, the Bronx and Central Brooklyn.

This suggests that the most intense presence of the NYPD Stop and Search Programme occurs in poorer neighbourhoods disproportionately inhabited by Black and Hispanic inhabitants. This suggests that the NYPD is disproportionately present in these areas and it is evidence for potential racial bias. However, to be more certain of this we would need data on crime statistics.

We keep only these predictors from the ACS data as predictors for our analysis, as shown below.

# now we create the new variables in the sqf_data_sf object, before merging as appropriate by stop_id into the original data
sqf_data_sf <- sqf_data_sf %>%
  mutate(pc_pop_black_est = (pop_black_est / pop_total_est) * 100,
         pc_pop_hisp_est = (pop_hisp_est / pop_total_est) * 100,
         pc_pop_asian_est = (pop_asian_est / pop_total_est) * 100,
         pc_pop_ba_above_est = (pop_ba_above_est / pop_total_est) * 100,
         pc_pop_inpov_est = (pop_inpov_est / pop_total_est) * 100
  )

# check current dim
dim(sqf_data)
## [1] 12037    59
sqf_data <- sqf_data %>%
  # left join selected spatial features from the sf object into sqf_data
  left_join(sqf_data_sf %>% select(stop_id, pc_pop_ba_above_est, pc_pop_inpov_est, pc_pop_asian_est, pc_pop_hisp_est, pc_pop_black_est), by = "stop_id") %>%
  # drop x,y coords and geometry as we use census tract for spatial info
  select(-c("stop_location_x", "stop_location_y", "geometry")) %>%
  # drop obs with missing values in these spatial features
  filter(!if_any(everything(), is.na))
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# check new dim
dim(sqf_data)
## [1] 11957    62

6 Regression Analysis

In order to predict arrest status in the SQF data (that is, arrest status conditional on being stopped as part of the SQF programme), we begin by estimating a logistic regression model.

To avoid overfitting issues and to assess out of sample predictive performance, we split the data into training, test, and, when appropriate, validation samples.

We perform this below:

# set seed for reproducibility
set.seed(1)

# set y and X matrix
y <- sqf_data$suspect_arrested_flag
X <- model.matrix(~ . - suspect_arrested_flag - stop_id -search_basis_incidental_to_arrest_flag, data = sqf_data)

# perform train-test split
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
y_train <- y[train_index]
y_test <- y[-train_index]
X_train <- X[train_index,]
X_test <- X[-train_index, ]

# print lengths and dimensions
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 8370
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 3587
cat("Dimensions of X_train:", dim(X_train), "\n")
## Dimensions of X_train: 8370 157
cat("Dimensions of X_test:", dim(X_test), "\n")
## Dimensions of X_test: 3587 157
# check balance of y
print(table(y_train))
## y_train
##    0    1 
## 5752 2618
print(table(y_test))
## y_test
##    0    1 
## 2465 1122

For our logistic, lasso and ridge estimators, we split our sample into a training sample, and a test sample at a ratio of 70:30, which is 8370:3587 at the observation level.

Note that, for each model type, we estimate using two sets of matrices of predictors to test sensitivity of prediction accuracy:

  • the first (X_train and X_test) simply drops stop_id, due to its irrelevance for prediction, and also search_basis_incidental_to_arrest_flag, as this indicates a search was carried out after an arrest.

  • the second set (X_train_subset and X_test_subset) also drops all further variables related to search, frisk and physical force. We drop these for a conservative approach to prediction, as it is unclear when these occur in relation to the arrest. We are more interesting in estimating the ex-ante likelihood of a stop ending in arrest; namely, the likelihood of a stop ending in arrest, conditional only on the information that a police officer has available to them before the stop begins.

We create this subset below:

# set x subset, removing anything that is/might be realized during the stop
X_subset <- X[, !grepl("^(search|physical_force|.*eye_color|.*hair_color)", colnames(X)) &
                 !colnames(X) %in% c("frisked_flag", "firearm_flag", "knife_cutter_flag",
                                     "other_weapon_flag", "weapon_found_flag", "other_contraband_flag")]


# perform train-test split
X_train_subset <- X_subset[train_index, ]
X_test_subset <- X_subset[-train_index, ]

cat("Dimensions of X_train_subset:", dim(X_train_subset), "\n")
## Dimensions of X_train_subset: 8370 114
cat("Dimensions of X_test_subset:", dim(X_test_subset), "\n")
## Dimensions of X_test_subset: 3587 114

Next, note that we also require a validation dataset for our tuning of hyperparameters in our tuned elastic net and random forest models.

For this reason, we split X_train_subset further into:

  • X_train_subset_final
  • X_val_subset

and correspondingly, y_train further into:

  • y_train_final
  • y_val

We only do these for the subset, as we only tune these models for the subset for computational reasons.2

We create these matrices here:

# subset training data for validation split
n_train <- nrow(X_train)
id_split <- sample(1:n_train, floor(0.5 * n_train))

# split into training and validation sets
X_train_subset_final <- X_train_subset[id_split, ]
y_train_final <- y_train[id_split]
X_val_subset <- X_train_subset[-id_split, ]
y_val <- y_train[-id_split]

# print dimensions
cat("dimensions of X_train_final:", dim(X_train_subset_final), "\n")
## dimensions of X_train_final: 4185 114
cat("dimensions of y_train_final:", length(y_train_final), "\n")
## dimensions of y_train_final: 4185
cat("dimensions of X_val:", dim(X_val_subset), "\n")
## dimensions of X_val: 4185 114
cat("dimensions of y_val:", length(y_val), "\n")
## dimensions of y_val: 4185

We now proceed by running our regression models, evaluating each of them based on their predictive performance in the test data.

Note that as we are working in a binary classification setting, we are looking to minimize misclassification error. The cost ratio in this framework is: \[C = \frac{L(1,0)}{L(0,1)}\].

This measures how costly false negatives are relative to false positives. The optimal decision is \[\hat{y_i} = 1 \iff p_i > {1 \over 1+C}\]

When looking at our confusion matrices, we choose \(C=1\) and weight both types of misclassification equally, and so classify according to the most likely class i.e \[p_i > 0.5 \longrightarrow \hat{y} = 1\]. We do this despite caring more about correctly predicting arrests. Our preferred model, will be the model with the highest Area Under the ROC Curve (AUC), which accounts for all potential values of \(C\).

6.1 Logistic Regression

Now the data is prepared, we want to specify the data we will use for our logit model. Our logistic regression model will serve as a simple baseline model. As it is a simple model, we will use X_train_subset, as this is a more conservative approach since we drop the variables that we are unclear happened before or after arrests.

We use logit because it deals with the classification based structure of predicting a binary random variable. However, this model could struggle with dealing with potential scarcity in the model. It will also struggle with multicollinearity due to the lack of variation in some of the dummy variables.

Since the model is sensitive to multicollinearity, we also drop the factor variables supervising_officer_rank as they are closely related to issuing_officer_rank, as it is likely issuing_officers of a specific rank have the same rank of supervisors.

# Set X_logit, remove supervising_officer_rank to avoid rank-deficient fit 
X_logit <- X_subset[, !grepl("^(supervising_officer_rank)", colnames(X_subset))]

Now we set the training and testing split using the X_logit data and run our model. We generate a confusion matrix, and compute the ROC and AUC.

# perform train-test split on X_logit
X_train_logit <- X_logit[train_index, ]
X_test_logit <- X_logit[-train_index, ]

# print lengths and dimensions
cat("Dimensions of X_train_logit:", dim(X_train_logit), "\n")
## Dimensions of X_train_logit: 8370 101
cat("Dimensions of X_test_subset:", dim(X_test_logit), "\n")
## Dimensions of X_test_subset: 3587 101
# glm requires dataframes as arguments
X_train_logit_df <- as.data.frame(X_train_logit)
X_test_logit_df <- as.data.frame(X_test_logit)

# full logit model in training data
logit <- glm(y_train ~ ., data = X_train_logit_df, family = binomial(logit))

# get fitted probabilities using trained model on test data
logit_predict <- as.numeric(predict(logit, newdata = X_test_logit_df, type = "response"))

# convert probabilities to class predictions using a threshold of 0.5
logit_y_hat <- ifelse(logit_predict > 0.5, 1, 0)

# inspect class balance of predicted classes
table(logit_y_hat)  
## logit_y_hat
##    0    1 
## 2828  759
# define a function to generate and plot confusion matrix
generate_cm <- function(true, predicted, title) {
   # gen confusion matrix as a data frame
   cm <- as.data.frame(table(True = true, Predicted = predicted)) %>%
   group_by(Predicted) %>%
   mutate(Predicted_pct = Freq / sum(Freq))
 
  # print cm
  print(cm)
 
  # plot cm
  plot <- ggplot(data = cm, mapping = aes(x = ordered(True, c(1, 0)), y = Predicted, fill = Predicted_pct)) +
    geom_tile() +
    geom_text(aes(label = round(Predicted_pct, 2)), color = 'white') +
    scale_fill_gradient(low = "blue", high = "red", name = "Rel. Freq.") +
    xlab("True") +
    ylab("Predicted") +
    labs(title = title) +
    theme_minimal()
 
  # print plot
  print(plot)
  
}

# plot confusion matrix for logistic regression with all variables
generate_cm(y_test, logit_y_hat, "Confusion Matrix for Logistic Regression")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2192         0.775
## 2 1     0           636         0.225
## 3 0     1           273         0.360
## 4 1     1           486         0.640

# compute ROC and AUC
logit_roc <- roc(response = y_test, predictor = logit_predict, quiet = TRUE)  
logit_auc <- round(auc(logit_roc), 4)  
cat("AUC for the logit model", logit_auc, "\n")
## AUC for the logit model 0.7751

Our confusion matrix shows that we have a specificity rate of 78% and a sensitivity rate of 64%. We also have an AUC of 0.775 to 3 significant figures. We care more about the arrest prediction, and thus sensitivity matter more to us, at equivalent AUC. However, even if the model is relatively less good at predicting arrests, it is still surprisingly good, given the weaknesses of logit with sparcity and multicollinearity. This suggests that whilst sparsity may be a problem, it may not be as serious an issue as we previously thought.

6.2 LASSO

Next we implement LASSO, where: \[ \beta^{LASSO} = \arg\min_{\beta} \sum_{i=1}^n\Big( y_i - x_i' \beta \Big)^2\] subject to \[ \sum_{j=1}^{p} |\beta_j| < t \]

We implement cross-validation in our training sample to ensure each datapoint is used for both for training and testing so that we have a robust estimate of our model’s performance. We use the default \(k=10\) folds for computational efficiency.

As we are operating in a classification setting, we choose the value of the parameter lambda that maximises the area under the ROC curve, and implement this by selecting type.measure = "auc" in our cv.glmnet regressions.

First we run our full model.

# run lasso on training data to collect coefficients
lasso <- cv.glmnet(x=X_train, y=y_train, alpha = 1, family="binomial", type.measure = "auc")

# optimal lambda
lasso_lambda_min <- lasso$lambda.min
cat("Optimal Lambda for Lasso:", lasso_lambda_min, "\n")
## Optimal Lambda for Lasso: 0.002825922
# plot auc against log lambda
plot(lasso)
abline(v = log(lasso_lambda_min), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation AUC (LASSO - Full)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "AUC")

# plot coefficients
plot(lasso$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO (Full Model)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities, using best lambda
lasso_predict <- as.numeric(predict(lasso, s = lasso_lambda_min, X_test, type = "response"))

# add variable for predicted classes
lasso_y_hat <- ifelse(lasso_predict > 0.5, 1, 0)

# plot confusion matrix for lasso regression out of sample
generate_cm(y_test, lasso_y_hat, "Confusion Matrix for LASSO (Full Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2352        0.908 
## 2 1     0           237        0.0915
## 3 0     1           113        0.113 
## 4 1     1           885        0.887

# compute ROC and AUC
lasso_roc_full <- roc(response = y_test, predictor = lasso_predict, quiet = TRUE)
lasso_auc_full <- round(auc(lasso_roc_full), 4)  
cat("AUC for the LASSO (full) model", lasso_auc_full, "\n")
## AUC for the LASSO (full) model 0.9462

We find an optimal lambda of 0.00283 to 3 significant figures.

The confusion matrix shows a specificity of 91%, and a sensitivity of 88%, with an AUC of 0.946 to 3 significant figures.

6.2.1 LASSO Sensitivity Analysis - Subset

We then run the LASSO analysis for the subset. This is most comparable to our simple logit regression in the data it uses.

# run lasso on training data to collect coefficients
lasso_subset <- cv.glmnet(x=X_train_subset, y=y_train, alpha = 1, family="binomial", type.measure = "auc")

lasso_lambda_min_subset <- lasso_subset$lambda.min
cat("Optimal Lambda for Lasso (Subset):", lasso_lambda_min_subset, "\n")
## Optimal Lambda for Lasso (Subset): 0.001751243
# plot auc against log lambda
plot(lasso_subset)
abline(v = log(lasso_lambda_min_subset), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation AUC (LASSO - Subset)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "AUC")

# plot coefficients
plot(lasso_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for LASSO (Subset)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities
lasso_predict_subset <- as.numeric(predict(lasso_subset, s = lasso_lambda_min_subset, X_test_subset, type = "response"))

# add variable for predicted classes
lasso_y_hat_subset <- ifelse(lasso_predict_subset > 0.5, 1, 0)

# plot confusion matrix for lasso subset regression out of sample
generate_cm(y_test, lasso_y_hat_subset, "Confusion Matrix for LASSO (Subset Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2226         0.769
## 2 1     0           667         0.231
## 3 0     1           239         0.344
## 4 1     1           455         0.656

# compute roc object for subset model
lasso_roc_subset <- roc(response = y_test, predictor = lasso_predict_subset, quiet = TRUE)
lasso_auc_subset <- round(auc(lasso_roc_subset), 4)  
cat("AUC for the LASSO (subset) model", lasso_auc_subset, "\n")
## AUC for the LASSO (subset) model 0.7771

We find an optimal lambda 0.00175 to 3 significant figures.

The confusion matrix shows a specificity of 77% and a sensitivity 66%, with an AUC of 0.777 to 3 significant figures. This is a modest improvement on the logit model which indicates that the lasso is able to deal with the mild multicollinearity and sparsity issues more slightly more effectively. However, it also suggests that these issues are not particularly important once we drop perfectly multicollinear variables in the logit model.

We find that the model performs a lot less well with the dropped features. What is particularly striking is the fall in the model sensitivity. This suggests that the model becomes a lot less able to predict arrests when we drop variables that are observed during the course of the stop, and potentially after the arrest. There are two potential interpretations of this. Either some of these observations occur after the arrest and hence would engineer predicted arrests. Or certain behaviour and suspicions during the course of the stop lead to officers enacting further measures which uncovers reasons for arrest. A combination of these is the most likely explanation. Hence the improvements from the full model will likely be partly genuine ex-ante predictive improvements.

However, for the purposes of comparison we will principally compare the subset regressions.

6.3 Ridge

Next, we implement ridge regression: \[ \beta^{Ridge} = \arg\min_{\beta} \sum_{i=1}^n\Big( y_i - x_i' \beta \Big)^2\] subject to \[ \sum_{j=1}^{p} \beta_j^2 < t \].

This is different from lasso in that the shrinkage will shrink coefficients towards zero but never actually to 0 unlike lasso.

We first run the full ridge regression:

# run ridge regression on training data to collect coefficients
ridge <- cv.glmnet(x = X_train, y = y_train, alpha = 0, family = "binomial", type.measure = "auc")

ridge_lambda_min <- ridge$lambda.min
cat("optimal lambda for ridge:", ridge_lambda_min, "\n")
## optimal lambda for ridge: 0.02457843
# plot auc against log lambda
plot(ridge)
abline(v = log(ridge_lambda_min), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation AUC (Ridge - Full)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "AUC")

# plot coefficients
plot(ridge$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Full Model)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities
ridge_predict <- as.numeric(predict(ridge, s = ridge_lambda_min, X_test, type = "response"))

# add variable for predicted classes
ridge_y_hat <- ifelse(ridge_predict > 0.5, 1, 0)

# plot confusion matrix for ridge regression out of sample
generate_cm(y_test, ridge_y_hat, "Confusion Matrix for Ridge (Full Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2357        0.903 
## 2 1     0           253        0.0969
## 3 0     1           108        0.111 
## 4 1     1           869        0.889

# compute roc object for full ridge model
ridge_roc_full <- roc(response = y_test, predictor = ridge_predict, quiet = TRUE)
ridge_auc_full <- round(auc(ridge_roc_full), 4)
cat("AUC for the Ridge (full) model:", ridge_auc_full, "\n")
## AUC for the Ridge (full) model: 0.9464

We find an optimal lambda of 0.0246 to 3 significant figures for our full ridge regression.

The confusion matrix for our full ridge displays a specificity of 90% and a specificity of 89%, with an AUC of 0.946 to 3 significant figures. This suggests that the ridge model is ever so slightly better than LASSO, since it is better able to predict arrests at the cost of a slightly lower specificity.

6.3.1 Ridge Sensitivity Analysis - Subset

We then run our ridge regression over our subset matrix.

# run ridge regression on subset predictors
ridge_subset <- cv.glmnet(x = X_train_subset, y = y_train, alpha = 0, family = "binomial", type.measure = "auc")

ridge_lambda_min_subset <- ridge_subset$lambda.min
cat("Optimal lambda for ridge (subset):", ridge_lambda_min_subset, "\n")
## Optimal lambda for ridge (subset): 0.01264539
# plot auc against log lambda
plot(ridge_subset)
abline(v = log(ridge_lambda_min_subset), col = "red", lwd = 2, lty = 2)
title(main = "Cross-Validation AUC (Ridge - Subset)",
      sub = "Optimal Lambda Highlighted in Red",
      xlab = "Log(Lambda)", ylab = "AUC")

# plot coefficients
plot(ridge_subset$glmnet.fit, xvar = "lambda", label = TRUE)
title(main = "Coefficient Shrinkage Path for Ridge (Subset)",
      xlab = "Log(Lambda)", ylab = "Coefficients")

# get fitted probabilities
ridge_predict_subset <- as.numeric(predict(ridge_subset, s = ridge_lambda_min_subset, X_test_subset, type = "response"))

# add variable for predicted classes
ridge_y_hat_subset <- ifelse(ridge_predict_subset > 0.5, 1, 0)

# plot confusion matrix for ridge subset regression out of sample
generate_cm(y_test, ridge_y_hat_subset, "Confusion Matrix for Ridge (Subset Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2221         0.772
## 2 1     0           657         0.228
## 3 0     1           244         0.344
## 4 1     1           465         0.656

# compute roc object for subset ridge model
ridge_roc_subset <- roc(response = y_test, predictor = ridge_predict_subset, quiet = TRUE)
ridge_auc_subset <- round(auc(ridge_roc_subset), 4)
cat("AUC for the Ridge (subset) model:", ridge_auc_subset, "\n")
## AUC for the Ridge (subset) model: 0.775

We find an optimal lambda of 0.0126 to 3 significant figures, which is much higher than the previous shrinkage regressions and suggests that this requires more aggressive shrinkage.

The confusion matrix for our ridge subset indicates a specificity of 77% and a sensitivity of 66%, with an AUC of 0.775 to 3 significant figures. Whilst the specificity and sensitivity rates are the same as in LASSO, this could be because of our specified C=1, since the AUC is lower than the LASSO regression. This suggests that LASSO’s property of shrinking sparse estimators to 0, rather than towards 0, like ridge is more effective.

However, the differences between models are still very modest so this suggests that they have similar predictive capabilities.

6.4 Tuned Elastic Net

Now we implement a tuned elastic net regression model, choosing to tune for both optimal lambda and alpha to get the optimal weighting between ridge and LASSO models. For this we use our validation data sample as mentioned earlier.

We loop over a defined set of alpha values from 0 to 1, at spacings of 0.1, due to computational efficiency. We then use the validation sample and pick the alpha lambda combination which yields the highest AUC. We then use the testing sample to compute the confusion matrix and AUC for the optimal model.

For computational efficiency, we only run the elastic net regression on the subset:

# Define alpha values for grid search
alpha_values <- seq(0, 1, by = 0.1) 

# Initialize storage for results
results <- data.frame(alpha = alpha_values, auc = NA, optimal_lambda = NA)

# loop grid search
for (i in 1:nrow(results)) {
  alpha_i <- results$alpha[i]
  
  # train en model with cross-validation for the given alpha
  en_opt <- cv.glmnet(
    x = X_train_subset_final,
    y = y_train_final,
    alpha = alpha_i,
    family = "binomial",
    type.measure = "auc"
  )
  
  # get the best lambda for the current alpha
  en_opt_lambda_min <- en_opt$lambda.min
  
  # get fitted probabilities on the validation set using the best lambda and current alpha
  en_opt_predict_val <- as.numeric(predict(en_opt, s = en_opt_lambda_min, newx = X_val_subset, type = "response"))
  
  # compute roc and aucs of trained model in validation data
  en_opt_roc_val <- roc(response = as.numeric(y_val), predictor = en_opt_predict_val, quiet = TRUE)
  en_opt_auc_val <- auc(en_opt_roc_val)
  
  # add to storage vector
  results$auc[i] <- en_opt_auc_val
  results$optimal_lambda[i] <- en_opt_lambda_min
}

# identify the best alpha
best_params <- results[which.max(results$auc), ]
cat("Optimal alpha:", best_params$alpha, "\n")
## Optimal alpha: 0.3
cat("Optimal lambda:", best_params$optimal_lambda, "\n")
## Optimal lambda: 0.02158539
cat("Best AUC:", best_params$auc, "\n")
## Best AUC: 0.731679
# train the tuned elastic net model with best hyperparameters
en_opt <- glmnet(
  x = X_train_subset_final,
  y = y_train_final,
  alpha = best_params$alpha,
  lambda = best_params$optimal_lambda,
  family = "binomial"
)

# get fitted probabilities using trained model on test data
en_opt_predict <- as.numeric(predict(en_opt, s = best_params$optimal_lambda, newx = X_test_subset, type = "response"))

# add variable for predicted classes
en_opt_y_hat <- ifelse(en_opt_predict > 0.5, 1, 0)

# plot confusion matrix for tuned elastic net regression out of sample
generate_cm(y_test, en_opt_y_hat, "Confusion Matrix for Elastic Net (Subset Model)")
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2284         0.751
## 2 1     0           756         0.249
## 3 0     1           181         0.331
## 4 1     1           366         0.669

# compute test ROC and AUC out of sample
en_opt_roc_subset <- roc(response = as.numeric(y_test), predictor = en_opt_predict, quiet = TRUE)
en_opt_auc_subset <- auc(en_opt_roc_subset)
cat("Test AUC for Tuned Elastic Net model:", en_opt_auc_subset, "\n")
## Test AUC for Tuned Elastic Net model: 0.7708506

We find an optimal alpha of 0.3 and an optimal lambda of 0.0216 to 3 significant figures. This suggests a heavier weighting towards a ridge estimator, which follows the “better” confusion matrix.

We find a specificity of 75% and a sensitivity of 67%, with an AUC of 0.771 to 3 significant figures. This is similar to our previous models, however, it performs less well on the AUC than any of the other models. It’s sensitivity is slightly better, however, its specificity is slightly worse than the other shrinkage models.

This is likely due to how we split the training dataset into a training and validation set, and we then run that on the test data. This means that the elastic net model only has half as much training data than our previous models. The fact that it performs similarly well, despite only having half the data, suggests that it would potentially perform even better if it used the same training data as the previous models.

6.5 Random Forest

Next, we implement the random forest algorithm to deal with potential non-linearities in the data, as well as the sparsity that the shrinkage regressions tackle.

We do this on the subset of the data as for similar reasons as with our previous regressions. Like with the elastic net, regression we only run the subset for the random forest for computational efficiency since we have two-way hyperparametre optimisation. This also means we use the three-way split of the data into a training sample, validation sample, and testing sample, like with the elastic net.

We tune the maximum number features of the tree first. We want to refine the maximum number of features to simplify the model. We choose the maximum number of features to loop over as \(m = 20\) and \(ntree \in {50, 100, 150}\), as they provide a good spread of values for a parametre optimisation grid search, whilst preserving computational efficiency.

We run the random forest regression here:

# Initialize parameter grid for grid search
ntrees <- seq(50, 150, 50)  # range of trees to tune over
max_features_range <- 1:20  # Range of mtry  (max features) to tune over

# initialize storage for accuracy metrics
results <- expand.grid(ntree = ntrees, mtry = max_features_range)

dim(results)
## [1] 60  2
results$train_acc <- NA
results$test_acc <- NA
results$oob_acc <- NA

# tune
for (i in 1:nrow(results)) {
  ntrees_i <- results$ntree[i]
  mtry_i <- results$mtry[i]

  # train rf model using training data and current iteration of mtry and ntree
  rf <- randomForest(
    x = X_train_subset_final,
    y = factor(y_train_final),
    mtry = mtry_i,
    ntree = ntrees_i
  )

  # use trained model to predict y in validation data
  y_pred_val <- predict(rf, X_val_subset)
 
  # get trained model predictions of y in training data
  y_pred_train <- rf$predicted

  # compute out of bag accuracy measure
  results$oob_acc[i] <- 1 - rf$err.rate[ntrees_i, "OOB"]
 
  # compute training accuracy measure
  results$train_acc[i] <- mean(y_train_final == as.numeric(levels(y_pred_train)[y_pred_train]))
 
  # compute "test" - validation - accuracy
  results$test_acc[i] <- mean(y_val == as.numeric(levels(y_pred_val)[y_pred_val]))
}

# identify the optimal parameters
best_params <- results[which.max(results$test_acc), ]
cat("Optimal ntree:", best_params$ntree, "\n")
## Optimal ntree: 100
cat("Optimal mtry:", best_params$mtry, "\n")
## Optimal mtry: 13
# define accuracy metrics and y-axis range for tuning plot
accuracy_metrics <- results[results$ntree == best_params$ntree, ]
y_range <- range(c(accuracy_metrics$train_acc, accuracy_metrics$test_acc, accuracy_metrics$oob_acc)) + c(-0.01, 0.01)

# plot train, validation, and OOB accuracy for optimal ntree as mtry varies
plot(
  accuracy_metrics$mtry, accuracy_metrics$train_acc, 
  type = 'l', col = 'blue', ylim = y_range,
  main = 'Tuning mtry for Optimal ntree',
  xlab = 'Number of Features (mtry)', ylab = 'Accuracy'
)
lines(accuracy_metrics$mtry, accuracy_metrics$test_acc, col = 'red')
lines(accuracy_metrics$mtry, accuracy_metrics$oob_acc, col = 'green', lty = 2)
legend('bottomright', 
       legend = c('Train Accuracy', 'Validation Accuracy', 'OOB Accuracy'),
       col = c('blue', 'red', 'green'), lty = c(1, 1, 2), bty = 'n')

# given optimal parameters, train using training data to get tuned model
RFTuned <- randomForest(
  x = X_train_subset_final,
  y = factor(y_train_final),
  mtry = best_params$mtry,
  ntree = best_params$ntree
)

# using model trained with optimal parameters, predict with TEST data
RFPred <- predict(RFTuned, newdata = X_test_subset)

# compute associated relevant fitted probabilities
RFProb <- predict(RFTuned, newdata = X_test_subset, type = "prob")

# compute test roc and auc of tuned model
RF_roc <- roc(response = factor(y_test), predictor = RFProb[, 2], levels = c(0, 1), quiet = TRUE)
RF_auc <- round(auc(RF_roc), 4)
cat("Random Forest AUC (Subset):", RF_auc, "\n")
## Random Forest AUC (Subset): 0.7919
# gen confusion matrix
generate_cm(
  true = y_test,
  predicted = RFPred,
  title = "Confusion Matrix for Random Forest (Subset)"
)
## # A tibble: 4 × 4
## # Groups:   Predicted [2]
##   True  Predicted  Freq Predicted_pct
##   <fct> <fct>     <int>         <dbl>
## 1 0     0          2271         0.763
## 2 1     0           706         0.237
## 3 0     1           194         0.318
## 4 1     1           416         0.682

We find an optimal tree length of 100 and an optimal feature inclusion of 18.

We find a specificity rate of 77% and a sensitivity rate of 71%, with an AUC of 0.793 to 3 significant figures. This is better than all of the previous linear models, despite having a similar training dataset issue as with the elastic net estimation.

This suggests that the random forest model is able to deal with non-linearities in the data that the linear classification models we used previously are not able to deal with. This could be aspects like interaction effects which we do not account for. Police officers may look for patterns in behaviour when arresting someone, for example they may be more likely to arrest tall, heavy, and young black men.

7 Results - Comparison

Finally we want to visualise the comparisons of the ROC curves and AUCs for all our regressions.

# function to plot ROC curves for multiple models
plot_multiple_roc <- function(roc_list, labels, colors, lwd_list, lty_list, main_title) {
  # Plot the first model as the base plot
  plot(
    x = 1 - roc_list[[1]]$specificities, 
    y = roc_list[[1]]$sensitivities, 
    type = "l",
    col = colors[1], 
    lwd = lwd_list[1], 
    lty = lty_list[1],
    xlim = c(0, 1), 
    ylim = c(0, 1), 
    main = main_title, 
    xlab = "False Positive Rate", 
    ylab = "True Positive Rate",
    cex.lab = 1.5, 
    cex.main = 1.8, 
    cex.axis = 1.2
  )
  
  # Add the remaining ROC curves
  for (i in 2:length(roc_list)) {
    lines(
      x = 1 - roc_list[[i]]$specificities, 
      y = roc_list[[i]]$sensitivities, 
      col = colors[i], 
      lwd = lwd_list[i], 
      lty = lty_list[i]
    )
  }
  
  # Add legend
  legend(
    "bottomright", 
    legend = labels, 
    col = colors, 
    lwd = lwd_list, 
    lty = lty_list, 
    bty = "n"
  )
}

# create list of rocs and associated labels and colors
roc_list <- list(lasso_roc_full, ridge_roc_full, logit_roc, RF_roc, ridge_roc_subset, lasso_roc_subset, en_opt_roc_subset)

labels <- c(
  sprintf("LASSO Full (AUC: %.4f)", lasso_auc_full),
  sprintf("Ridge Full (AUC: %.4f)", ridge_auc_full),
  sprintf("Logistic Regression (AUC: %.4f)", logit_auc),
  sprintf("Random Forest Subset (AUC: %.4f)", RF_auc),
  sprintf("Ridge Subset (AUC: %.4f)", ridge_auc_subset),
  sprintf("LASSO Subset (AUC: %.4f)", lasso_auc_subset),
  sprintf("Elastic Net Subset (AUC: %.4f)", en_opt_auc_subset)
)

colors <- c("red", "darkgreen", "blue", "darkorange", "lightgreen", "pink", "purple")
lwd_list <- c(3, 3, 2, 2, 2, 2, 2)
lty_list <- c(2, 3, 1, 1, 3, 2, 1)


#call
plot_multiple_roc(
  roc_list = roc_list,
  labels = labels,
  colors = colors,
  lwd_list = lwd_list,
  lty_list = lty_list,
  main_title = "ROC Curves for All Models"
)

# make and print sorted table of AUCs
auc_table <- data.frame(
  Model = c(
    "Logistic Regression",
    "LASSO Full",
    "LASSO Subset",
    "Ridge Full",
    "Ridge Subset",
    "Elastic Net Subset",
    "Random Forest Subset"
  ),
  AUC = c(
    logit_auc,
    lasso_auc_full,
    lasso_auc_subset,
    ridge_auc_full,
    ridge_auc_subset,
    en_opt_auc_subset,
    RF_auc
  )
)

auc_table <- auc_table %>% 
  arrange(desc(AUC))

print(auc_table)
##                  Model       AUC
## 1           Ridge Full 0.9464000
## 2           LASSO Full 0.9462000
## 3 Random Forest Subset 0.7919000
## 4         LASSO Subset 0.7771000
## 5  Logistic Regression 0.7751000
## 6         Ridge Subset 0.7750000
## 7   Elastic Net Subset 0.7708506

As discussed before, we see how the two full shrinkage regressions perform exceptionally well out of sample, however, this is likely be due to potential results engineering due to some variables being potentially taking place after an arrest. The other ROC curves seem to otherwise be very similar, however, the ROC curve for the random forest estimation clearly stands out as having a better out of sample fit, compared to its peers.

7.1 Variable Importance

We then consider the most important variables in each of the logit and shrinkage regressions. An important point to note is that we scale the importance of these variables, so that the most important variable has an importance of 1, and the least important 0. Therefore, a scaled importance of 1 is NOT indicative of the problem of multicollinearity (not perfect multicollinearity) or overfitting, it is an indication of relative importance.

We show our results here:

# Function to compute and plot variable importance
compute_variable_importance <- function(coefficients, model_name, top_n = 20) {
  importance_df <- data.frame(
    Variable = names(coefficients),
    Scaled_Importance = abs(coefficients) / max(abs(coefficients), na.rm = TRUE)
  ) %>%
    filter(Scaled_Importance > 0) %>%
    arrange(desc(Scaled_Importance)) %>%
    slice(1:top_n)
  
  ggplot(importance_df, aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
    geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
    coord_flip() +
    labs(
      title = paste("Top", top_n, "Var Imp for", model_name),
      x = "Variable",
      y = "Scaled Importance"
    ) +
    theme_minimal()
}

# 1. Logistic Regression
logit_coeff <- coef(logit) 
logit_coeff <- logit_coeff[-1] 
logit_plot <- compute_variable_importance(logit_coeff, "Logistic Regression")
print(logit_plot)

# 2. LASSO Subset
lasso_subset_coeff <- as.numeric(coef(lasso_subset, s = lasso_lambda_min_subset)) 
names(lasso_subset_coeff) <- rownames(coef(lasso_subset, s = lasso_lambda_min_subset))
lasso_subset_coeff <- lasso_subset_coeff[-1] 
lasso_plot <- compute_variable_importance(lasso_subset_coeff, "LASSO Subset")
print(lasso_plot)

# 3. Ridge Subset
ridge_subset_coeff <- as.numeric(coef(ridge_subset, s = ridge_lambda_min_subset)) 
names(ridge_subset_coeff) <- rownames(coef(ridge_subset, s = ridge_lambda_min_subset))
ridge_subset_coeff <- ridge_subset_coeff[-1] 
ridge_plot <- compute_variable_importance(ridge_subset_coeff, "Ridge Subset")
print(ridge_plot)

# 4. Elastic Net Subset

en_subset_coeff <- as.numeric(coef(en_opt, s = best_params$optimal_lambda))
names(en_subset_coeff) <- rownames(coef(en_opt, s = best_params$optimal_lambda))
en_subset_coeff <- en_subset_coeff[-1]
en_plot <- compute_variable_importance(en_subset_coeff, "Elastic Net Subset") 
print(en_plot)

We see that the logistic curve classifies dummies indicating the rank of the officer conducting the stop as the most important variables in predicting arrest, with certain suspected crimes also being highly predictive. The latter makes sense, as for it is easy to understand why on certain crimes like terrorism, the police would err on the side of caution and arrest a stopped subject. It is also easy to understand why certain very specific crimes, like criminal posession of a forged instrument, might only lead to stops when the threshhold for arrest has already been surpassed.

On the side of the shrinkage estimators a couple key factors show up in each of the three types of predictors. The officer_explained_stop_flag is a very important predictor and it suggests that an officer explaining why a suspect has been stopped means there is good evidence for the suspect to be arrested and the officer is following protocol to avoid potential legal complications. We also see certain crime and officer rank dummies featuring highly. Their presence would be explained by the same explanation as with the logit estimator.

Next we consider the most important variables in the random forest estimation:

# Extract Random Forest importance
rf_importance_subset <- data.frame(
  Variable = rownames(RFTuned$importance),
  Scaled_Importance = RFTuned$importance[, "MeanDecreaseGini"] / max(RFTuned$importance[, "MeanDecreaseGini"], na.rm = TRUE)
)
rf_importance_subset <- rf_importance_subset %>%
  arrange(desc(Scaled_Importance))

# plot rf variable importance - top 20
ggplot(rf_importance_subset %>% arrange(desc(Scaled_Importance)) %>% slice(1:20),
       aes(x = reorder(Variable, Scaled_Importance), y = Scaled_Importance)) +
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 20 Var Imp for RF",
       x = "Variable",
       y = "Scaled Importance") +
  theme_minimal()

In the random forest model, we still see the officer_explained_stop_flag as being relatively important, along with certain suspected crime dummies, however, these features are a lot less important than in the linear models we explored earlier. What we see here as being most important are demographic dummies describing a neighbourhood’s racial composition and level of education. This suggests that the percentage of a neighbourhood in poverty, with a Bachelor’s degree, or who are Hispanic are the best predictors of arrest. This is more supportive of the initial assessment that demographic characteristics play an important role in how the police police different neighbourhoods. This would be supportive of the suggestion that the NYPD have some level of racial bias in how they approach these stops.

8 Conclusion

Our analysis of the NYPD’s stop-and-frisk programme in 2023 indicates that the prediction of whether a police officer makes an arrest is best explained by a random forest. Our random forest model, performs modestly but decisively better than both the shrinkage and logit estimators. This suggests that the process by which a stop turns into an arrest is both modestly sparse and non-linear, which is why the random forest is able to perform better.

Our initial exploration of the dataset’s descriptive statistics and the random forest model indicates that racial bias plays an important role both in how the police, approach the stop-and-frisk programme and in how they make their arrests. There appears to be a two-way effect. Our exploratory analysis suggests that the police have a more intense present in poorer neighbourhoods disproportionately inhabited by ethnic minorities. Therefore, the random forest suggests that when they make a stop in a more affluent and educated neighbourhood there is more reason to arrest the person. On the other hand the high predictive importance of racial and income components in the random forest indicates that the police are both more present in these neighbourhoods and more willing to arrest stopped individuals in them. These individuals are dispropotionately poorer ethnic minorities.

A potential extension to this model would be to attempt to fit the “full” regression model to both the random forest and the tuned elastic net. This would be viable with more computing power, and it would be a useful extension to see whether the random forest was robust to adding new variables. We could also attempt to fit the models on previous years stop-and-frisk data to see if their out-of-sample goodness of fit extended to other years’ data.

Despite, having a large dataset of over 10000 observations, even after our dataset undergoes significant cleaning and feature engineering, our dataset still exhibits a high degree of granularity due to the large number of predictors. Therefore, it is unsurprising that some of these predictors exhibit little to no variation. We can see this in the chart showing the percentage of “nulls” in each feature, since many of these “nulls” are actually zeros.This lack of variation is also clear in some of the spacial features, as we can see that not all localities in the map have an arrest percentage to report in our exploratory analysis. Whilst, we drop multicollinear variables for our logit model, and shrinkage estimators and the random forest are designed to be more robust to a lack of variation in features, there could still be some issues with overfitting due to the sheer number of features with limited variation.


  1. In the raw data, the flag’s entries are actually "Y" and "N", but we clean this.↩︎

  2. Run time increases significantly if we tune elastic net and random forest for the full set of predictors and the subset of predictors.↩︎